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Abstract — In this paper, we address the problem of distributed 
interference management of cognitive femtocells that share the 
same frequency range with macrocells (primary user) using 
distributed multi-agent Q-learning. We formulate and solve three 
problems representing three different Q-learning algorithms: 
namely, centralized, distributed and partially distributed power 
control using Q-learning (CPC-Q, DPC-Q and PDPC-Q). CPC- 
Q, although not of practical interest, characterizes the global 
optimum. Each of DPC-Q and PDPC-Q works in two different 
learning paradigms: Independent (IL) and Cooperative (CL). 
The former is considered the simplest form for applying Q- 
learning in multi-agent scenarios, where all the femtocells learn 
independently. The latter is the proposed scheme in which 
femtocells share partial information during the learning process 
in order to strike a balance between practical relevance and 
performance. In terms of performance, the simulation results 
showed that the CL paradigm outperforms the IL paradigm and 
achieves an aggregate femtocells capacity that is very close to the 
optimal one. For the practical relevance issue, we evaluate the 
robustness and scalability of DPC-Q, in real time, by deploying 
new femtocells in the system during the learning process, where 
we showed that DPC-Q in the CL paradigm is scalable to large 
number of femtocells and more robust to the network dynamics 
compared to the IL paradigm. 

I. Introduction 

Femtocells have been recently proposed as a promising 
solution to the indoor coverage problem. Although femtocells 
offer significant benefits to both the operator and the user, 
several challenges have to be solved to fully reap these 
benefits. One of the most daunting challenges is their interfer- 
ence on macro-users and other femtocells |[T], ||2]. Typically, 
femtocells are installed by the end user and hence, their 
number and positions are random and unknown to the network 
operator a priori. Adding to this the typical dynamics of the 
wireless environment, a centralized approach to handle the 
interference problem can not be feasible which, in turn, calls 
for a distributed interference management strategies. 

Based on these observations, in this paper, we focus on 
closed access femtocells ||3| working in the same bandwidth 
with macrocells (i.e. cognitive femtocells), where the femto- 
cells will be the secondary users who try to perform power 
control to maximize their own performance while maintain 
the macrocell capacity at certain level. In order to handle 
the interference generated by the femtocells on the macrocell 
users, we will use a distributed reinforcement learning |3] 
technique called multi-agent Q-learning fS) and JS). In our 



context, a prior model of the environment cannot be achieved 
due to 1) the unplanned placement of the femtocells, 2) 
the typical dynamics of the wireless enshrinement. In such 
context, Q-Learning offers significant advantages to achieve 
optimal decision policies through realtime learning of the 
environment Q- 

In the literature, Q-leaming has been used several times to 
perform power allocation in femtocell networks. In |l8], authors 
used independent learning (IL) Q-leaming to perform power 
allocation in order to control the interference generated by the 
femtocells on the macrocell user In IjTJ, authors introduced a 
new concept called docitive femtocells where a new femtocell 
can fasten its learning process by leaming the policies acquired 
by the already deployed femtocells, instead of learning from 
scratch. The policies are shared by Q-table exchange between 
the femtocells. However, after the Q-tables are exchanged, all 
the femtocells take their actions (powers) independently, which 
may generate an oscillating behavior in the system. In ||9], we 
developed a distributed power allocation algorithm called dis- 
tributed power control using Q-leaming (DPC-Q). In DPC-Q, 
two different learning paradigms were proposed: independent 
learning (IL) and cooperative learning (CL). It was shown 
that both paradigms achieves convergence. Moreover, the CL 
paradigm outperforms the IL one through achieving higher 
aggregate femtocells capacity and better fairness (in terms of 
capacity) among the learning femtocells. 

However, in JQ] we did not evaluate the performance of 
DPC-Q against the networks dynamics, specially after conver- 
gence. Also, we did not have any benchmarking algorithm to 
compare the performance of DPC-Q to. Thus, the contribution 
of this paper can be summed up as follows: 

> we propose two new Q-learning based power allocation 
algorithms: namely, centralized power control using Q- 
learning (CPC-Q) and partially distributed power con- 
trol using Q-learning (PDPC-Q). CPC-Q is used for 
benchmarking purposes, where a central controller, which 
has all the information about the system (channel gains 
of all femtocells, system noise, ■ ■ ■), is responsible for 
calculating the optimal powers that the femtocells should 
use. PDPC-Q, which is an agent based power control 
algorithm, is proposed as: 1) it gives the operator the 
flexibility to work on a global base (e.g. aggregate 
femtocell capacity instead of subcarrier based femtocell 



capacity as in DPC-Q), 2) it makes DPC-Q comparable 
to CPC-Q. 

• we evaluate the robustness and scalability of DPC-Q, in 
both IL and CL paradigms, against two of the dynamics 
that typically exist in the wireless environment: namely, 
the random activity of femtocells (when new femtocells 
are deployed in the system during the learning process) 
and the density of the femtocells in the macrocell cov- 
erage area (the number of femtocells that are interfering 
on the macro users). 

• we compare our proposed DPC-Q in both IL and CL 
paradigms to the idea of docitive femtocells presented in 
0. 

The rest of this paper is organized as follows. In section 
Hn the system model is described. Section |lll] presents a 
brief background about multi-agent Q-leaming. In section |IV] 
the proposed Q-learning based power allocation algorithms 
are presented. The simulation scenario and the results are 
discussed in section [V] Finally the conclusions are drawn in 
section |Vll 

II. System Model 

We consider a wireless network composed of one macro 
cell (with one single transmit and receive antenna Macro 
Base Station (MBS)) that coexists with Nf femtocells, each 
with one single transmit and receive antenna Femto Base 
Station (FBS). The Nf femtocells are placed indoors within 
the macrocell coverage area. Both the MBS and the FBSs' 
transmit over the same K subcarriers where orthogonal down- 
link transmission is assumed. Um and Uf macro and femto 
users are located randomly inside the macro and femto cells 
respectively. Femtocells within the same range can share 
partial information during the learning process to enhance their 
performance. 

(k) (k) 

Po and pn denote the transmission powers of the MBS 
and FBS n on subcarrier k respectively. Moreover, the max- 
imum transmission powers for the MBS and any FBS n are 

Pmax and Pl^^ respectively, where J2k=i < ^niax and 

The system performance is analyzed in terms of the capacity 
measured in (bits/sec/Hz). The capacity achieved by the MBS 
at its associated user on subcarrier k is: 
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where h'^oo indicates the channel gain between the trans- 
mitting MBS and its associated user on subcarrier fc; /i„o 
indicates the channel gain between FBS n transmitting on 
subcarrier k and the macro user Finally cr^ indicates the noise 
power The capacity achieved by FBS n at its associated user 
on subcarrier k is: 
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where hnn indicates the channel gain between FBS n 
transmitting on subcarrier k and its associated user; /i^^l^ 



indicates the channel gain between FBS n' transmitting on 
subcarrier k and the femto user associated to FBS n. 
III. Multi-agent Q-learning 

The scenario of distributed cognitive femtocells can be 
mathematically formulated using stochastic games flOj, where 
the learning process of each femtocell is described by a task 
defined by the quintuple {N, S, A, P, R{s, a)}, where: 

• iV = {1, 2, • • • , Nf} is the set of agents (i.e. femtocells). 
» S = {si,S2,-- - ,Sm} is the set of possible states that 

each agent can occupy, where m is the number of possible 
states. 

» A = {ai, a2, - ■ ■ ,ai} is the set of possible actions that 
each agent can perform for each task, where I is the 
number of possible actions. 

• P is the probabilistic transition function that defines the 
probability that an agent transits from one state to another, 
given the joint action performed by all agents. 

• R{s, a) is the reward function that determines the reward 
fed back to an agent n by the environment when the joint 
action a is performed in state s E S. 

In the distributed cognitive femtocells scenario, P can not 
be deduced due to the dynamics of the wireless environment. 
Thus, one of the most famous techniques that calculates 
optimal policies without any prior model of the environment 
is Q-learning . Q-learning assigns each task of each agent a 
Q-table whose entries are known as Q-values Q{s„i,ai), for 
each state «,„ G S and action a; e A. Thus, the dimension 
of this table is m x L The Q-value Q{sm,ai) is defined to 
be the expected discounted reward over an infinite time when 
action a; is performed in state s,„, and an optimal policy is 
followed thereafter [jQ. The learning process of each agent 
n at time t can be described as follows: 1) the agent senses 
the environment and observes its current state 6 S, 2) 
based on sJJ^, the agent selects its action a" randomly with 
probability e or according to: a" — argma.XaeAQniSm^'^) 
with probability 1 — e, where Q^(s„i,a) is the row of the Q- 
table of agent n that corresponds to state s"^ at time t, and e is 
an exploration parameter (a random number) that guarantees 
that all the state-action pairs of the Q-table is visited at least 
once, 3) the environment makes a transition to a new state 
s'^, e S and the agent receives a reward r*j = R{s]^j^,a) due 
to this transition, 4) the Q-value is updated using equation [5] 
and the process is repeated. 

Ql^\sl,aT) :=(l-a)Q^(C,«r)+ 

a eA 

where a is called the learning rate and < 7 < 1 is the 
discount factor that determines how much effect the future 
rewards have on the decisions at each moment. It should be 
noticed that the reward r*^ depends on the joint action a of 
all agents not on the individual action aj. This is the main 
difference between the multi-agent scenario described here 
and the single-agent one (when Nf ~ 1). In the single-agent 
case, one of the conditions needed to guarantee that the Q- 
values converges to the optimal ones is that: the reward of the 



agent must be dependent only on its individual actions (i.e. 
the reward function is stationary for each state-action pair) 
Is), ifTTl . However, for the multi-agent scenario, the reward 
function is not stationary from the agent point of view, since 
it now depends on the actions of other agents. Thus, the 
convergence proof used for the single-agent case can not be 
used in the multi-agent one. 

IV. Power Allocation using Q-learning 

In this section, the three proposed Q-leaming based power 
allocation algorithms will be presented: 

A. Distributed Power Control Using Q-learning (DPC-Q) 

DPC-Q is a distributed algorithm where multiple agents 
(i.e: femtocells) aim at learning a sub-optimal decision policy 
(i.e: power allocation) by repeatedly interacting with the envi- 
ronment. The DPC-Q algorithm is proposed in two different 
learning paradigms: 

• Independent learning (IL): In this paradigm, each agent 
learns independently from other agents (i.e: ignores other 
agents' actions and considers other agents as part of the 
environment). Although, this may lead to oscillations and 
convergence problems, the IL paradigm showed good 
results in many applications [HI. 

• Cooperative learning (CL): In this paradigm, each agent 
shares a portion of its Q-table with all other cooperat- 
ing agentfl aiming at enhancing the femtocells' perfor- 
mance. CL is performed as follows: each agent shares 
the row of its Q-table that corresponds to its current state 
with all other cooperating agents (i.e. femtocells in the 
same range). Then, each agent n selects its action a" 
according to the following equation: 

a^^argmsixi V Q„'(s"',a)) (4) 

a^A ^ — ^ 

l<n'<Nf 

The main idea behind this strategy is explained in details 
in ||9l. In terms of overhead, if the number of femtocells 
is Nf, then the total overhead needed is Nf.{Nf — 1) 
messages (each of size per unit time (i.e. the overhead 
is quadratic in the number of cooperating femtocells). 
DPC-Q is an agent and subcarrier based algorithm (i.e. the 

capacities, states, actions, reward functions are defined for 

each agent over each subcarrier) ||9]: 

. Agents: FBS'„,V1 <n<Nf 

• States: At time t for femtocell n on subcarrier k, the 
state is defined as: s"''' = Ut^Pt} where 7*^ g {0,1} 
indicates the level of interference measured at the macro- 
user on subcarrier k at time t: 



0, Co ' > r° 



(5) 



where r° is the target capacity determining the QoS per- 
formance of the macrocell. We assume that the macrocell 

'We assume that the shared row of the Q-table is put in the control bits 
of the packets transmitted between the femtocells. The details of the exact 
protocol lie out of the scope of this paper 



reports the value of Co*^^ to all FBSs through the backhaul 
connection. 

P" defines the power levels used to quantize the total 
power FBS n is using for transmission at time t: 

Pt^y, {pL.-f)<j:k=oPt''<pL. (6) 

where Al and A2 are arbitrary selected thresholds (sev- 
eral values for Al and A2 as well as more power levels 
were tried through the simulations and the performance 
gain between these values was marginal). 

m Actions: The action here is scalar, where the set of 
actions available for each FBS is defined as the set of 
possible powers that a FBS can use for transmission on 
each subcarrier In the simulations, a range from —20 to 
P^icix dBm with step of 2 dBm is used. 

• Reward Functions: The reward fed back to agent n on 
subcarrier k at time t is defined as: 

n,k _ I ^ t; , 2^k=oPt — ^max 

(7) 

The rationale behind this reward function is that each 
femtocell will aim at maximizing its own capacity while: 
1) maintaining the capacity of the macrocell around the 
target capacity r° (convergence is assumed to be within 
a range of ±1 bits/sec/Hz from V°), 2) not exceeding the 
allowed P^„^. 

This reward function was compared to the reward func- 
tion defined in ||9l : 

''e-(Co'°'-r°)' 



Z-tk=Oi^t — max /o\ 

1-1 b"'''^ > pf 

( L-ik=Ol't ^ max 

where it was shown that both reward functions maintain 
the capacity of the macrocell within the convergence 
range. However, reward function |7] was able to achieve 
higher aggregate femtocell capacity. In this paper, we 
show another advantage for reward function |7] which 
is: it learns (explores or reacts to network dynamics) 
better than reward function |8] even when the exploration 
parameter e is not used. This mainly depends on the initial 
value of the Q-values. In this paper, we initially set all 
the Q-values to zero. Thus, when e is not used, using 
reward function |8] will always feed the agent back with a 
positive reward (given that P^^x not exceeded). Thus, 
if initially agent n was in state s"''' on subcarrier k and 
took action p"'*^, the Q-value of this action Q„(s"'*'',p"'*'') 
will be updated using a positive valued reward, thus this 
Q-value will increase with time, and agent n will keep 
using the same action forever (since the action is chosen 
according to the maximum Q-value). Thus, using e with 
reward function |8] is a must to have better exploration 
behavior On the other hand, using reward function |7] 
may feed the agent back with positive or negative valued 
rewards (e~('-^o'"'~'"°'^ could be smaller than e^'-^"'''). 
Thus, given the same initial conditions, agent n could 



receive a negative valued reward after taking action p"'*^, 
leading to the decrease of its Q-value with time. Once the 
Q-value decreases below zero, the agent will take another 
action whose Q-value is greater than the decreased one. 
Thus, reward function |7] learns (explores) better than 
reward function |8] 
In this paper, we also evaluate the robustness and scalability 
of DPC-Q, in both IL and CL paradigms. We believe that the 
CL paradigm is much more robust and scalable against the 
network dynamics compared to the IL paradigm. The reason 
is that after sharing the row of the Q-table, each femtocell 
will know the states that all other cooperating femtocells 
are occupying, and since a state at a certain moment can 
be defined as: how the agent sees the environment at that 
moment, each femtocell can implicitly know 1) how all other 
femtocells can react to the network dynamics, 2) what actions 
other femtocells are going to take. However, if the femtocells 
took their actions independently (i.e. IL paradigm), even after 
knowing the states of each other, oscillating behaviors that 
may not reach convergence may be generated. One way to 
overcome this problem is to force the femtocells to make use 
of the information shared while taking their actions (i.e. taking 
the actions cooperatively: equationlU. This could decrease the 
oscillations in the system, making the femtocells more robust 
towards the increase of the number of deployed femtocells, 
and towards the sudden effect caused by any new deployed 
femtocell. 

B. Partially Distributed Power Control Using Q-learning 
(PDPC-Q) 

PDPC-Q is a partial distributed algorithm, where it is a 
multi-agent algorithm but only agent dependent (i.e. the states, 
actions, reward functions are defined for each agent over all 
subcarriers). As DPC-Q, PDPC-Q works in both IL and CL 
paradigms. The agents, states, actions and reward functions 
used for the PDPC-Q algorithm are defined as follows: 

. Agents: FBS'„,V1 <n< Nf 

• States: At time t the state is defined as: st = {It} where 
It G {0, 1} indicates the level of interference measured 
at the macro-user over all subcarriers at time t: 



TABLE I 

Taxonomy of the proposed algorithms. 



Co </3° 
Co >/3° 



(9) 



where Co = X^a^i Ci*"' is the aggregate macrocell 
capacity and f3° is the target aggregate macrocell capacity. 
Actions: For FBS n, the set of actions is defined to be 
a set of vectors where each vector represents the powers 
FBS n is using on all subcarriers. 
Reward Functions: Reward function |7] can be redefined 
as: 

C = e-fc--/^")^ _ e-^" (10) 



where C„ = J2k=i '■^^ aggregate capacity of FBS 

n. Note that since PDPC-Q is not subcarrier based, a 
power vector in which P^^j. is exceeded will never be 
assigned for any FBS. Thus, there is no need to put a 





DPC-Q/IL 


DPC-Q/CL 


CPC-Q 


PDPC-Q 


Complexity 


action IS 
scalar 


action IS 
scalar 


\A\ grows 
exponentially 
in N f and K 


\A\ grows 
exponentially 
in K 


Reaction 
to network 
dynamics 


Inefficient & 
non-robust 


Efficient &l 
robust 




CL is more 
efficient & 
robust than 
IL 


Scalability 


Inefficient at 
large A''_^' 


Efficient at 
large A^_^ 


Infeasible at 
large A'"_^' 


CL is more 
scalable 
than IL 


Speed of 
Convergence 


Medium 
convergence 


Fast 

convergence 


Slow 

convergence 
since \A\ 
is huge 


CL is faster 
than IL 


Overhead 


None 


~ Nf 
messages each 
of size \A\ 


Huge 


CL has larger 
overhead than 
IL 



negative reward here as in DPC-Q case. The same goes 
for CPC-Q. 

C. Centralized Power Control Using Q-learning ( CPC-Q) 

CPC-Q is a centralized power control algorithm used to 
evaluate the performance of our proposed DPC-Q algorithm. 
CPC-Q can be regarded as the single-agent version of the 
DPC-Q, and hence, its convergence to the optimal Q-values 
and thus optimal powers is guaranteed. However, using a 
centralized controller is not feasible in terms of overhead in 
multi-agent scenarios. Thus, CPC-Q works only for small scale 
problems. The agent, states, actions and reward functions used 
for CPC-Q are defined as follows: 

« Agents: A centralized controller 

. States: The same as PDPC-Q. 

« Actions: For the central controller, the set of actions 
is defined to be a set of matrices where each matrix 
represents the powers of all femtocells over all subcarri- 
ers. However, the size of this set grows exponentially 
with both the number of femtocells and the number 
of subcarriers. Thus, forming the matrices (all possible 
actions) from a large set of powers such as the one used 
in DPC-Q will be infeasibleB 

• Reward Functions: Since CPC-Q is global, reward 
function |7] can be redefined as: 

n = 6"^^°"'^°)' - e"'^J'™'° (11) 

where Cfemto is defined as Cfemto = Snii Y^k=i Cn''''- 
Finally, for the rest of the paper, reward functions [TTland 
[To] will be referred to as Rl, while reward function |8] will be 
referred to as RO. The three proposed algorithms are compared 
qualitatively in table U 

V. Performance Evaluation 
A. Simulation Scenario 

We consider a wireless network consisting of one macrocell 
serving Um ~ 1 macro user underlaid with Nf femtocells. 
Each femtocell serves Uj = \ femto-user, which is randomly 
located in the femtocell coverage area. All of the macro 

-In the simulations, the set of powers used to form the matrices and the 
vectors in CPC-Q and PDPC-Q respectively is: {0, 6, 12} dBm. 



and femto cells share the same frequency band composed 
of K subcarriers, where orthogonal downlink transmission is 
assumed. In the simulations, K will change according to the 
algorithm used: for DPC-Q, K = 6, while for both CPC-Q and 
PDPC-Q, K ~ 3. The channel gain between any transmitter i 
and any receiver j on subcarrier k is assumed to be path-loss 
dominated and is given by: 



A-PL) 



(12) 



where dij is the physical distance between transmitter i and 
receiver j, and PL is the path loss exponent. In the simulations 
PL = 2 is used. The distances are calculated according to 
the following assumptions: 1) The maximum distance between 
the MBS and its associated user is set to 1000 meters, 2) 
The maximum distance between the MBS and a femto-user is 
set to 800 meters, 3) The maximum distance between a FBS 
and its associated user is set to 80 meters, 4)The maximum 
distance between a FBS and another femtocell's user is set to 
300 meters, 5) The maximum distance between a FBS and the 
macro-user is set to 800 meters. 

We used MatLab on a cluster computing facility with 300 
cores to simulate such scenario, where in the simulations we 
set the noise power to 10^^, the maximum transmission 
power of the macrocell P^a^^ to 43 dBm, the maximum 
transmission power of each femtocell Pj^ax to 15 dBm, each 
of the power levels Al and A2 is set to 5 dBm, the learning 
rate a to 0.5, the discounted rate 7 to 0.9 and the random 
number e to 0.1 lEl and lH). 

B. Numerical Results 



Figure [l |{a)| shows the aggregate femtocells capacity (as 
a function of the number of femtocells) using CPC-Q and 
PDPC-Q with i?l in both IL and CL paradigms. It can be 
observed that CL is much better than IL, where from the figure 
it can be shown that the aggregate capacity gain of CPC-Q over 
PDPC-Q in case of CL is marginal. Since CPC-Q is considered 
the single agent version of DPC-Q, it should converge to 
the global optimal values. This is shown in the figure at 
small number of femtocells {Nf = 1 and 2). The optimal 
values are calculated using exhaustive search over all possible 
actions, where the optimal value is defined to be the maximum 
aggregate capacity the system can achieve while maintaining 
the capacity of the macrocell in the convergence range (±1 
bits/sec/Hz from (3°). However, starting from Nf = 3, CPC-Q 
begins to be infeasible since the size of the possible actions 
set A becomes very large (at Nf = 5: \A\ = 320,000,0). 
So, besides the computational problems, the condition of 
visiting all state-action pair becomes infeasible. Thus, getting 
the optimal value is also not feasible (that's why we stopped 
CPC-Q at Nf = 4). Note also that we stopped the exhaustive 
search at Nf = 5 due to complexity and memory problems, 
while PDPC-Q is shown at Nf = 6 and 7 just to illustrate the 
continuity of our algorithm. 

Figures |2] and |3] show the robustness of the proposed DPC- 
Q algorithm. In these figures, we started with Nf = 5, then 
we added a new femtocell after every 4000 iteration^o reach 




(a) Aggregate femtocells capacity versus the number of femtocells. 
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(b) Aggregate feintocells capacity versus learning iterations. 

Fig. 1. Aggregate femtocells capacity using CPC-Q and PDPC-Q with Rl 
in both IL and CL paradigms. 

Nf = 29 at the 96000th iteration. Finally, we add another 
femtocell at the 99000th iteration. The figures show how DPC- 
Q using the CL paradigm is more robust to the deployment 
of new femtocells compared to the IL paradigm. Moreover, 
in these figures we compare the performance of DPC-Q to 
the docitive idea presented in ||7l. We investigated two cases: 
1) the already deployed femtocells share their Q-tables with 
the new femtocells when they first join the system (suffixed 
with _share on the figure), 2) the new deployed femtocells 
starts with a zero initialized Q-tables (suffixed with _scratch 
on the figure). Figure |2] shows the macrocell convergence on 
a certain subcarrier using DPC-Q with i?l in both IL and CL 
paradigms, where it can be observed that the CL paradigm 
maintains the macrocell capacity within the range of conver- 
gence (6 ± 1 bits/sec/Hz) and reacts well to the effect of the 
new deployed femtocells, without the need to have a learning 
phase again every time a new femtocell is deployed, which is 
a very interesting observation. It can also be observed that our 
proposed CL paradigm converges to the same value regardless 
the already deployed femtocells shared its Q-tables with the 
new ones or not. So, sharing could be ignored, thus decreasing 
the overall overhead. On the other hand, the IL paradigm 
showed a very bad reaction to the network dynamics, where 
1) convergence is not attained (i.e. an oscillating behavior is 
generated), 2) as Nf increases, IL paradigm may push the 
macrocell capacity out of the convergence range when the 
network becomes more dense. Thus, CL is more scalable than 
IL. However, it can be noticed that the docitive idea is useful 
in the IL paradigm, where sharing the Q-tables of the already 
deployed femtocells with the new ones is much better (in terms 
of the value that the macrocell capacity oscillates around) than 
beginning with zero-initialized (scratch) Q-tables. In terms of 
speed of convergence, it can be noticed that, although the 
learning process may need large number of iterations initially, 
CL decreases the dynamics of the learning process, and hence, 
making it faster This can be noticed from the figure, where 




Fig. 2. Convergence of the macrocell capacity over the Q-iterations on a 
certain subcarrier where Nf was initially 5, then incremented until Nf = 30. 




Fig. 3. Aggregate femtocells capacity over the Q-iterations where Nf was 
initially 5, then incremented until Nf = 30. 

CL converged almost at the 300th iteration, which is much 
earlier than the IL paradigm. Also, after the deployment of 
each new femtocell, CL took less than 10 iterations only - 
around 0.01 seconds - to re-achieve convergence. 

Figure [3] shows the aggregate femtocells capacity over 
the learning iterations. It can be noticed that using the CL 
paradigm, the aggregate capacity increases as more femtocells 
are deployed in the network, while in the IL paradigms, 
since convergence is already not maintained, the aggregate 
capacity behavior has a sporadic behavior, which indicates 
clearly that IL is not efficient to react to the network dynamics. 
However, in the CL paradigm from the 64000th to the 72000th 
iteration( 640th to 720th according to figure's scale), it can 
be noticed that the aggregate capacity decreases. The reason 
is that as more femtocells are being deployed, the network 
becomes very dense and since using the CL paradigm makes 
the cooperating femtocells use the same powers, this may force 
the macrocell capacity to violate the range of convergence. 
Thus, all the femtocells will have to decrease the power used 
to maintain again the macrocell capacity within the range 
of convergence leading to the decrease of their aggregate 
capacity. Note that at the 64000th and 72000th iterations, e 
is already removed, which proves that Rl learns well even 
when e is removed. 

Finally, in order to compare the aggregate capacity the 
CL paradigm achieves, after the incremental deployment of 
femtocells, to the ideal value, we used the small scale problem 
again. This is shown in figure [l |tb)[ where we started with 
Nf = 2 and added an extra femtocell at the 8000th, 12000th 
and 13500th iteration^ Again, it can be observed that CL 
achieves aggregate capacity that is very close to the optimal 
one while the IL paradigm is far from it. 

'in figures |2] and |3] e is removed at the 50000th iteration and the figures 
were drawn with step = 100 in order to achieve better resolution. 

''in figure mb)| e is removed at the 12000th iteration and the figure was 
drawn with step = 10 in order to achieve better resolution. 



VI. Conclusion 

In this paper, three Q-learning based power allocation algo- 
rithms for cognitive femtocells scenario are presented: namely, 
DPC-Q, CPC-Q and PDPC-Q. Although DPC-Q was pre- 
sented in previous work, in this paper it is extended, in both of 
its learning paradigms: IL and CL, to evaluate its performance, 
robustness and scalability. In terms of performance, DPC-Q is 
extended to PDPC-Q and then compared to CPC-Q, where the 
simulations showed that the CL paradigm outperforms the IL 
and achieves aggregate femtocell capacity that is very close 
the optimum one. In terms of robustness, the CL paradigm 
was found to be much more robust against the deployment of 
new femtocells during the learning process, where the results 
showed that the CL paradigm outperforms the IL paradigm in: 
1) maintaining convergence, 2) learning better (i.e. reacting 
better to the network dynamics), especially when a suitable 
reward function such as the one defined in the simulations is 
used, 3) converging to the target capacity regardless the old 
femtocells share their experience (i.e. Q-tables) with the new 
deployed ones or not and 4) speeding up the convergence. 
Finally, in terms of scalability, CL paradigm reacted better to 
the network dynamics and maintained convergence, even when 
the number of the femtocells is large . 
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